Skip to content

fix: make exclude patterns recursive to prevent index pollution#76

Merged
PatrickSys merged 4 commits intomasterfrom
fix/recursive-exclude-patterns
Mar 18, 2026
Merged

fix: make exclude patterns recursive to prevent index pollution#76
PatrickSys merged 4 commits intomasterfrom
fix/recursive-exclude-patterns

Conversation

@PatrickSys
Copy link
Owner

Summary

  • Extract shared EXCLUDED_DIRECTORY_NAMES and EXCLUDED_GLOB_PATTERNS into src/constants/codebase-context.ts — single source of truth for the indexer, file-watcher, and project-discovery
  • Make all exclude globs recursive (**/coverage/** instead of coverage/**) so nested occurrences in monorepo packages and worktrees are caught
  • Add missing directories: .cache, .claude, .planning, worktrees, target, vendor, .nx, .turbo, .next, build

Fixes the two high-severity findings from the consumer audit (2026-03-17, SSP_Portal):

  1. Worktrees/.claude directories leaking into the index
  2. Nested generated artifacts (packages/ui/coverage/prettify.js) indexed as real code

This is the only unique delta from the closed PR #75 (commit d1197a9).

Test plan

  • New integration test (tests/indexer-exclude-patterns.test.ts) reproduces the exact audit failure case — nested coverage/, .claude/worktrees/, worktrees/, and dist/ directories
  • Full suite passes (312/312)
  • ESLint + Prettier clean on all changed files

The indexer's exclude patterns were non-recursive (e.g. `coverage/**`),
only matching at the project root. Nested occurrences in monorepo
packages and worktrees passed through, polluting the index with
generated artifacts and worktree copies.

- Extract EXCLUDED_DIRECTORY_NAMES and EXCLUDED_GLOB_PATTERNS into
  src/constants/codebase-context.ts as the single source of truth
- Indexer, file-watcher, and project-discovery all import from there
- Add missing directories: .cache, .claude, .planning, worktrees,
  target, vendor, .nx, .turbo, .next, build
- Add integration test reproducing the consumer audit failure case
  (nested coverage/, .claude/worktrees/, worktrees/, dist/)
@greptile-apps
Copy link

greptile-apps bot commented Mar 18, 2026

Greptile Summary

This PR centralises all exclude-directory logic into a single src/constants/codebase-context.ts source of truth and converts the indexer's previously non-recursive glob patterns (coverage/**) to recursive ones (**/coverage/**), fixing index pollution from nested monorepo packages and worktree checkouts.

Changes:

  • src/constants/codebase-context.ts — adds EXCLUDED_DIRECTORY_NAMES (readonly tuple), EXCLUDED_GLOB_PATTERNS (**/{dir}/** mapped from the tuple), and DISCOVERY_ONLY_IGNORED; this eliminates three separate hardcoded lists
  • src/core/indexer.ts — drops the partial, non-recursive exclude list in favour of EXCLUDED_GLOB_PATTERNS; the key fix: coverage/****/coverage/** so deeply-nested generated artifacts are now excluded
  • src/core/file-watcher.ts — adopts the shared patterns and gains previously-missing directories (build, target, vendor, .claude, worktrees)
  • src/utils/project-discovery.ts — rebuilds IGNORED_DIRECTORY_NAMES from the two shared constants; backward-compatible, only adds new entries
  • tests/indexer-exclude-patterns.test.ts — new integration test reproducing the exact audit failure; two minor issues: analyzerRegistry is not cleaned up in afterEach (potential singleton pollution if more tests are added), and the index-format fallback silently returns [] which could mask polluter-assertion failures

Issues found:

  • EXCLUDED_GLOB_PATTERNS is typed as mutable string[] rather than readonly string[], allowing callers to accidentally mutate the shared constant
  • analyzerRegistry.register in beforeEach has no corresponding cleanup, risking duplicate registrations across tests
  • The index-format fallback in the test defaults to an empty array silently, which can make the polluter assertions trivially pass without actually testing anything
  • [...EXCLUDED_GLOB_PATTERNS] spread in file-watcher.ts is a no-op copy of an already-mutable array

Confidence Score: 4/5

  • Safe to merge — the core change is correct and well-tested; only minor style and test-robustness issues remain
  • The logic change (making glob patterns recursive and centralising them) is straightforward and strictly improves coverage with no regressions. The existing consumers are all updated consistently and the behaviour for previously-excluded directories is unchanged. The one-point deduction reflects the two test issues (registry pollution and silent empty-chunk fallback) that could mask regressions in future iterations of the test suite.
  • tests/indexer-exclude-patterns.test.ts — analyzerRegistry cleanup and silent fallback; src/constants/codebase-context.ts — EXCLUDED_GLOB_PATTERNS mutability

Important Files Changed

Filename Overview
src/constants/codebase-context.ts Adds EXCLUDED_DIRECTORY_NAMES, EXCLUDED_GLOB_PATTERNS, and DISCOVERY_ONLY_IGNORED as a single source of truth; minor issue: EXCLUDED_GLOB_PATTERNS should be typed as readonly string[]
src/core/indexer.ts Replaces hardcoded non-recursive exclude patterns (e.g. node_modules/) with EXCLUDED_GLOB_PATTERNS (/{dir}/**) — core fix for nested artifact indexing; logically correct
src/core/file-watcher.ts Adopts shared EXCLUDED_GLOB_PATTERNS; adds previously-missing directories (build, target, vendor, worktrees, .claude) to the ignored list; unnecessary spread is a minor style issue
src/utils/project-discovery.ts Rebuilds IGNORED_DIRECTORY_NAMES from the two shared constants; change is backward-compatible and only adds new directories (.cache, .claude, .nx, .planning, .codebase-context, worktrees)
tests/indexer-exclude-patterns.test.ts New integration test for the audit failure case; has two issues: analyzerRegistry not cleaned up between tests (potential singleton pollution), and the index-format fallback silently produces an empty chunk list that can mask test failures

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[src/constants/codebase-context.ts] -->|EXCLUDED_DIRECTORY_NAMES| B[EXCLUDED_GLOB_PATTERNS\n**\/dir\/**]
    A -->|DISCOVERY_ONLY_IGNORED| C[project-discovery\nIGNORED_DIRECTORY_NAMES]
    B --> D[src/core/indexer.ts\nexclude: EXCLUDED_GLOB_PATTERNS]
    B --> E[src/core/file-watcher.ts\nignored: EXCLUDED_GLOB_PATTERNS]
    A -->|EXCLUDED_DIRECTORY_NAMES| C

    D -->|glob exclude| F{File path\nmatches **/dir/**?}
    E -->|chokidar ignored| F
    C -->|Set.has check| G{Directory name\nin ignored set?}

    F -->|Yes| H[Skip — not indexed / not watched]
    F -->|No| I[Index / Watch]
    G -->|Yes| J[Skip — don't recurse]
    G -->|No| K[Recurse into directory]
Loading

Last reviewed commit: "style: format with p..."

let tempDir: string;

beforeEach(async () => {
analyzerRegistry.register(new GenericAnalyzer());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 analyzerRegistry not cleaned up between tests

analyzerRegistry.register(new GenericAnalyzer()) is called in beforeEach but there is no corresponding cleanup in afterEach. Because analyzerRegistry is a module-level singleton (ES module caching means all tests in the same worker share it), each additional test added to this describe block will register another GenericAnalyzer instance, causing the registry to accumulate duplicates across runs. If analyzerRegistry has a clear() or unregister() method, call it in afterEach; otherwise, use vi.resetModules() or move the setup into the single test body.

Suggested change
analyzerRegistry.register(new GenericAnalyzer());
analyzerRegistry.clear(); // reset any previously registered analyzers
analyzerRegistry.register(new GenericAnalyzer());

(If clear() does not exist, expose one on the registry or isolate module state with vi.isolateModules.)

Comment on lines +68 to +74
const chunks = (
Array.isArray(indexRaw)
? indexRaw
: Array.isArray(indexRaw?.chunks)
? indexRaw.chunks
: []
) as Array<{ filePath: string }>;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Fragile index-format detection silently produces empty chunk list

If the on-disk format of KEYWORD_INDEX_FILENAME is neither a bare array nor { chunks: [...] }, the chunks variable silently defaults to []. In that scenario:

  • indexedPaths is []
  • The "polluter" assertions (expect(leaked).toEqual([])) all pass trivially
  • Only the "legitimate file must be indexed" assertion catches the problem

A silent pass on the polluter checks is dangerous because it looks like the exclusion works when the test is actually not exercising anything. Consider throwing an explicit error when the format is unrecognised:

Suggested change
const chunks = (
Array.isArray(indexRaw)
? indexRaw
: Array.isArray(indexRaw?.chunks)
? indexRaw.chunks
: []
) as Array<{ filePath: string }>;
const chunks = (
Array.isArray(indexRaw)
? indexRaw
: Array.isArray(indexRaw?.chunks)
? indexRaw.chunks
: (() => { throw new Error(`Unexpected index format: ${JSON.stringify(Object.keys(indexRaw ?? {}))}`); })()
) as Array<{ filePath: string }>;

Comment on lines +52 to +54
export const EXCLUDED_GLOB_PATTERNS: string[] = EXCLUDED_DIRECTORY_NAMES.map(
(dir) => `**/${dir}/**`
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 EXCLUDED_GLOB_PATTERNS should be readonly string[]

EXCLUDED_DIRECTORY_NAMES is correctly declared as const (giving it a readonly tuple type), but EXCLUDED_GLOB_PATTERNS is typed as a mutable string[]. Any consumer can accidentally push, pop, or splice into it, breaking all three callers (indexer, file-watcher, and project-discovery).

Suggested change
export const EXCLUDED_GLOB_PATTERNS: string[] = EXCLUDED_DIRECTORY_NAMES.map(
(dir) => `**/${dir}/**`
);
export const EXCLUDED_GLOB_PATTERNS: readonly string[] = EXCLUDED_DIRECTORY_NAMES.map(
(dir) => `**/${dir}/**`
);

'**/.next/**',
'**/.cache/**'
],
ignored: [...EXCLUDED_GLOB_PATTERNS],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unnecessary spread of an already-mutable array

EXCLUDED_GLOB_PATTERNS is already typed as string[], so spreading it into a new array literal is redundant — it just allocates a copy.

Suggested change
ignored: [...EXCLUDED_GLOB_PATTERNS],
ignored: EXCLUDED_GLOB_PATTERNS,

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 31fdda5df3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

'.nx',
'.planning',
'.turbo',
'build',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not drop nested build scripts from indexing

Adding 'build' to the shared recursive exclude list means mergeConfig() now passes **/build/** to scanFiles() (src/core/indexer.ts:278,1033-1039) and startFileWatcher() ignores the same paths (src/core/file-watcher.ts:46-50). In monorepos that keep checked-in tooling under packages/*/build/ or similar, those source files will silently disappear from search and auto-refresh even though the old build/** rule only skipped the repository root’s build output.

Useful? React with 👍 / 👎.

- Make EXCLUDED_GLOB_PATTERNS readonly to prevent accidental mutation
  by consumers (spreads at call sites are now intentional, creating
  mutable copies for APIs that require string[])
- Throw on unrecognized index format in test instead of silently
  defaulting to empty array (prevents polluter assertions from passing
  vacuously)
- Move analyzerRegistry.register into test body — only one test, no
  need for beforeEach ceremony
Set.has() requires the argument to match the set's type parameter.
Spreading as-const tuples into a Set infers a narrow literal union,
which rejects entry.name (plain string) at the call site on line 178.
@PatrickSys PatrickSys merged commit a814b24 into master Mar 18, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant